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Abstract 



^/J I Differential privacy provides the first tlieoretical foundation with provable privacy guarantee 

^_) I against adversaries with arbitrary prior knowledge. The main idea to achieve differential privacy 

tyj I is to inject random noise into statistical query results. Besides correctness, the most important 

O . goal in the design of a differentially private mechanism is to reduce the effect of random noise, 

ensuring that the noisy results can still be useful. 

This paper proposes the compressive mechanism, a novel solution on the basis of state-of- 

^ ' the-art compression technique, called com,pressive sensing. Compressive sensing is a decent 

^— ^ . theoretical tool for compact synopsis construction, using random projections. In this paper, 

we show that the amount of noise is significantly reduced from 0{^/n) to 0(log(n)), when the 

pf-\ . noise insertion procedure is carried on the synopsis samples instead of the original database. 

As an extension, we also apply the proposed compressive mechanism to solve the problem of 

continual release of statistical results. Extensive experiments using real datasets justify our 

accuracy claims. 

-^ , 1 Introduction 

X. 

JH , No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to 

attacks upon his honor and reputation. 

Universal Declaration of Human Rights 

Rapid advances in information technology and computational capacity are raising concerns 
regarding the privacy of sensitive personal information. Previous work has shown that even after 
removing all personal identity attributes such as names and addresses, adversaries may still be able 
to identify specific individuals by combining their prior knowledge with the published attributes in 
the database, e.g., age, gender and race. For example, in 2007, a team of researchers successfully 
reidentified two customers in the "anonymized" Netflix data set [1], based on their transaction 
histories with Netflix and movie comments on IMDB [37j. Moreover, recent studies show that group 
statistics are also vulnerable to privacy attacks due to inference techniques combining anonymized 
data and existing and/or public information. In Genome Wide Association Studies (GWAS), 
for instance, the DNA samples from participating patients are mixed to prevent disclosure of their 
identities, and only statistics regarding the prevalence of particular single-nucleotide polymorphisms 



(SNPs) are published. However, an adversary can still verify the presence of a specific person in 
the mixture with high confidence, given a DNA sample from that person and a reference DNA 
mixture |29l 145] , Such privacy problems have become a major obstacle to biomedical research, as 
they make access to useful input data increasingly difficult to obtain. 

To tackle these problems, numerous privacy protection frameworks have been proposed. Among 
these, differential privacy outshines others in providing strong robustness guarantee against attacks 
from adversaries with arbitrary prior knowledge. Simply put, a randomized mechanism (query 
answering method) for answering statistical queries satisfies differential privacy if and only if the 
query results will be almost identical after modifying or deleting one of the records in the database 
|21j . This requirement ensures that sensitive information in a record almost cannot be inferred, 
even if the adversary knows the values of all the other records in the database. 

To achieve differential privacy, one basic mechanism is to add random noise to the statistical 
query results [2Tj. In particular, given the unperturbed query result and a parameter e called 
the privacy budget, the mechanism randomly selects a number following a Laplacian distribution, 
with mean at the original query result and scale proportional to the sensitivity (discussed later) of 
the queries times -. Theoretical analysis shows that the resulting randomness in the query result 
fulfills the requirements of differential privacy with respect to privacy budget e. While this basic 
mechanism handles a single query quite well, it is not effective for processing a large number of 
queries, as it requires a privacy budget linear in the number of queries to maintain a given level of 
result accuracy. Thus queries may quickly exhaust the total privacy budget assigned to a database, 
forcing the database to be taken offline to avoid unacceptable privacy breaches. In such situations, 
a better alternative is to build a compact, privacy-preserving synopsis from the data with a fixed 
amount of privacy budget, such that it is capable of answering all possible queries. 

There have been several previous efforts to design and construct such synopses, e.g., wavelets 
[36] . trees [28] and linear summation basis [301 ISS [12]. These synopsis structures are lossless, 
meaning that they preserve all information in the original database. In other words, the original 
statistics can be completely and accurately recovered by running the corresponding decoding algo- 
rithms on the synopsis structures. Consequently, the size of these synopses, as well as the privacy 
budget they require, grow linearly with the size of the dataset. 

This paper explores a new direction: probabilistic synopses based on compressive sensing |17[ 
|8l[T0]. Figure [1] illustrates this new compressive sensing mechanism. Using a sparse representation 
of the original data, we use compressive sensing to encode a very small synopsis, compared to 
the original database size. We then add Laplacian noise to the synopsis, making it differentially 
private; decode the synopsis, creating a noisy version of the original data; then answer an unlimited 
number of queries over the decoded data, without adding additional noise. The compressive sensing 
mechanism allows us to use less noise than previous synopsis proposals under certain conditions, 
and provides much more accurate statistical query results after decoding. Unlike previous methods 
that focus on specific classes of queries, the compressive sensing mechanism is universal, supporting 
all possible queries on the decoded noisy data. Thus the compressive sensing mechanism can be 
seamlessly incorporated into any applications with privacy concerns, from GWAS analysis to user 
transaction history mining. We show that the compressive mechanism improves the accuracy of 
the result statistics by up to an order of magnitude, in both theoretical analysis as well as empirical 
studies. 

The organization of the paper is as follows. Section [2] summarizes existing studies of differential 
privacy and compressive sensing. Section [3] gives an overview of differential privacy and compressive 
sensing. Section H] introduces universal mechanisms and the basic definitions of our target problems. 
Section [5] presents the compressive mechanism and analyzes its error. Section [6] generalizes the 
compressive mechanism to a streaming environment, for continual release of statistics. Section [7] 




Figure 1: Compressive mechanism framework. 

demonstrates the usefulness of these new methods through empirical studies with real data sets. 
Finally, Section [8] concludes the paper and outlines interesting research directions for future work. 

2 Related Work 

This section gives an overview of work on differential privacy and compressive sensing. 



2.1 Differential Privacy 

The notion of e-differential privacy was introduced in |21lll8j . Alternative definitions soon appeared, 
such as (e, (5)-differential privacy [20l HQ], pan-privacy [231 [23] and zero-knowledge privacy [25|. 
(e, (5)-differential privacy relaxes e-differential privacy, while pan-privacy and zero-knowledge privacy 
are stronger than e-differential privacy and apply to special circumstances. This work uses the more 
popular e-differential privacy definition. [19j provides a comprehensive survey of the development 
of differential privacy. 

Considerable efforts have been devoted to the design of e-differentially private mechanisms, but 
the majority of them only deal with linear counting queries, e.g., [HI dgl [271 [Ml [23 [23 [2H]- In 
contrast, our compressive sensing mechanism can handle arbitrary statistical queries. Particularly, 
[46j uses a Haar wavelet transform and [28] designs a tree structure for answering range counting 
queries. However, neither the Haar wavelet transform nor tree structure can be generalized to 
answer other queries in a private and accurate way. [46j and [28j are merged into a unified framework 
of |30j for answering linear counting queries. A recent work [Ij] separates counting queries from 
arbitrary low-sensitivity queries and essentially shows that counting is much easier than arbitrary 
low-sensitivity queries. [391 [3] use (deterministic) Fourier transforms, for problems that are very 
different from ours; while we take advantage of a probabilistic transform in compressive sensing to 
reduce the dimension of data from 0{n) to O(logn). 



2.2 Compressive Sensing 

Compressive sensing was introduced by [U [TTl [TT] , and later shown to have extensive applications 
in imaging [4H [33l [6] , signal processing [33] , computational biology [32] ) geophysical data analysis 



|31j . communications [13] and so on. To the best of our knowledge, we are the first to apply 
compressive sensing to sensitive data analysis. 

(JHt Wi\ [32] apply random projections to differential privacy. They show that the compressed 
data can be used for certain statistical tasks, and do not consider the reconstruction process of 
compressive sensing. [13] reconsiders the contingency table release problem [3] for sparse data 
(without any transformation), without using compressive sensing. [22] is the most relevant paper in 
the sense that it works in the opposite direction of ours: applying privacy to compressive sensing. 

3 Preliminaries 

3.1 Notation 

We use M (^^) to denote the set of real numbers (positive real numbers) and [l,n] to represent 
the integer set {1,2,... ,n}. For two vectors v,w ^ M", {v,w) means their inner product. The 

p-norm {p is a positive integer) of a vector v S M" is defined to be (X^iLi |i'[^]|^)'' and is denoted 
by llfllp. For example, the 2-norm of the difference of two points gives their Euclidian distance. 
We write log for log2. The number e is the base of natural logarithms. We let Lap{\) denote the 
one-dimensional Laplacian distribution centered at with scale A and the corresponding density 
function g(x) = i^e ^ . The composition of functions g and h is denoted g o h, meaning that we 
first apply g to input and then h to the output of g. 

We use other notation common in theoretical computer science. When describing asymptotic 
complexity, we use O (pronounced soft-0) as a variant of O (pronounced big-O) that ignores 
logarithmic factors. For instance, if the complexity is 0(n/ log n), we simply write 0{n). Also, 
exp(n) means e for some constant C . With high probability means with probability at least 0.99. 

3.2 Compressive Sensing 

This section gives a brief overview of the theory of compressive sensing, and we refer the reader to 
an excellent survey [8] for more information. As shown in Figured! compressive sensing consists of a 
probabilistic compression procedure, also called the sampling process, followed by a reconstruction 
process that decodes the compressed data. The sampling process reduces the data size from 0{n) to 
O(logn). The rather complex decoding process exactly or approximately reconstructs the original 
data from the compressed samples. Readers who are not interested in the mathematics underlying 
the compressive sensing technique should skip to the next section. 

In what follows, all vectors are over M" unless otherwise noted. Consider a vector D that 
we wish to represent using an orthonormal basis (such as a standard basis, or a wavelet basis) 
* = [i/'i, • • • , ipn\- Let X = {x[l], . . . ,x[n]) be the coefficient sequence of D under the new basis 
^. Then we have: 



D[j] = ^x[i]if^,[j] 



i=l 

If we treat ^ as an n x n matrix with ipi, . . . ,tp^ as the columns, D can be written as ^a;, and 
we say that x represents D under the new basis. We call x S-sparse if it has at most S nonzero 
entries. Let xs be obtained from x by replacing its n — S" coefficients with smallest absolute value 
by 0. Then xs is 5-sparse. 

Often data are compressible. Given a constant < p < 1, we say vector x is (p-) compressible 
with magnitude R if its components taken in sorted order obey \xu\\ < R ■ i^^'^,\/i G [^,n\. A 



compressible vector x can be well approximated by an 5-sparse vector in the sense that ||a; — tc^Hi < 
Cp ■ R ■ S^~^'P for some constant Cp. A vector D has an (S-)sparse representation if there is an 
orthonormal basis ^ (called a sparse basis) where Ds representation x is sparse or compressible, 
i.e., X is S-sparse or is compressible to an S'-sparse vector xs- 

The input to compressive sensing is a vector D with a sparse representation. Then there is a 
sampling process which can be characterized as a linear mapping. We use a matrix $ G M. to 
describe the sampling operator and the result is a vector y = ^D G M'^. Candes and Tao p^ 
define the r-th restricted isometry constant 6r of $ to be the smallest number such that 

(l-,5,)||a;||2<||*a:||2<(l + 5,)||a;||2, 

for all r-sparse vectors x S M". The condition that 62s ^ 1 implies that all pairwise distances 
between S'-sparse signals must be well-preserved in the measurement space, i.e., 

(1 - 52s)\\xi - X2\\l < ||*a;i - ^X2\\l < (1 + 52s)\\xi - avails, 

for all 5-sparse vectors iCi,iC2- We loosely say that a matrix $ satisfies the Restricted Isometry 
Property (RIP) if it has 623 ^ 1- Such $ always exists. For example, with probability 1 — exp(— /c), 
a random matrix $ formed by sampling independent and identically distributed (i.i.d.) entries from 
a symmetric Bernoulli distribution, more precisely 

Pro6(*(^,j)=±-^) = i, 



satisfies RIP, provided that 



k>C-Slogin/S), 



where C is some constant [5]. In words, every entry of $ equals -^ with probability 1/2 and 

equals — jr with probability 1/2. In the following discussion, we will assume that $ E M. is in 

such a form. It is not hard to verify that A = $^ G M*"'^" also satisfies RIP with overwhelming 
probability 1 — exp(— A;), where k = Q{Slog{n/S)), for any fixed orthonormal basis ^ G R"^" [S]. 
Up to now, we have clarified the input and the sampling process, obtaining a sample vector 
y = ^D G M.^ with k = @{Slog{n/S)). The next and final step is to reconstruct the vector D from 
y through the sparse representation of D. The samples may be contaminated with an unknown 
noise e G M'^, and the sample vector becomes 

y* = y + e = <^D + e = Ax + e, 

where A = $^ is known from the sampling process. Candes, Romberg and Tao [7] prove the re- 
markable and surprising result that by solving a combinatorial optimization problem, the recovered 
answer x* G M" can be close enough to x even in the presence of unknown perturbations. Needell 
and Tropp |38) prove essentially the same error bound using a greedy algorithm. The details of the 
two algorithms can be found in the appendix and the result is summarized as follows. 

Lemma 1 ([3, [38]) Suppose A satisfies RIP and \\e\\2 < 9. Then ||:K-a;*||2 < ^-^W^J^sWi j^c^B, 
for some constants C2 and C3. 



v^ 



For a compressible vector x G M", we mentioned that | \x — xs\\i < Cp- R- S^~^'^ for some p G 
(0, 1). Also note that D can be recovered by setting D* = '^x* and that \\D — D*\\2 = ||a; — a;*||2. 
We can derive the following corollary for our purpose. 



Corollary 1 Suppose that A satisfies RIP and\\e\\2<e. Then \\D-D*\\2 = 0{S^/'^-^/'p + 9) for 
some constant S <^ n and p £ (0, 1). 

If the input vector D G R" has no sparse representation, then the D* € M" obtained by 
compressive sensing could be very far away from D, and that is why we require the input to have 
a sparse representation. Formally, 

Corollary 2 Suppose A satisfies RIP and \\e\\2 < 0. If D has no sparse representation, then 

||i:>-i:>*||2 = o{n/^ + e). 

Finally, we remark that the time complexity and the space complexity of compressed sensing 
are both 0{n). 

3.3 Differential Privacy 

In this paper, we represent a database as a vector D G M". This abstraction encompasses many 
previous abstractions of data, such as a data distribution [20], histogram [27], contingency table [3], 
private bits [16], the database itself, and recommendation systems j34]. Two databases Di, D2 G M" 
are said to be neighboring iff \\Di — -D2II1 < 1- The notion of differential privacy is defined as 
follows. 

Definition 1 ( |i2T| I18j ) A randomized mechanism /C provides e- differential privacy if for all neigh- 
boring databases Di,D2 G M" and all Sub C Range{lC), 

Prob{K{Di) G Sub) <e' X Prob{IC{D2) G Sub), 

where the probability space in each case is over the coin flips o//C. 

A popular mechanism for achieving e-differential privacy is the Laplacian mechanism |21j . which 
can be used when the output of the mechanism is numeric. McSherry and Talwar developed a 
technique called the exponential mechanism [35] for problems where the output is non- numeric. 

Laplacian mechanism: The sensitivity of a query Q : M" — )• W^ is defined to be 

Aq= max ||5-(i:)i) -g(i:>2)||i, 

IJl,U2 

for all neighboring Di,D2 G M". [21j shows the following result: 

Lemma 2 ( |21| ) For Q : M" — )• R"^, the mechanism ICq that adds independently generated noise 
with distribution Lap{AQ/e) to each of the d output values provides e-differential privacy. 

Exponential mechanism: This mechanism is for the case where the query answer y is not 
numerical. We rely on a pre-defined utility function u{D, y) (with a numeric output) to measure the 
quality of y, compared to the exact answer. The exponential mechanism outputs y with probability 
proportional to e"*^"*- '^'''•^ \ where Am is the sensitivity of the utility function u{D,y). The 
exponential mechanism provides e-differential privacy [35) . The distance of an answer from the best 
answer, which has the smallest u, exhibits an exponential tail and with probability almost 1, the 
exponential mechanism outputs an object with an approximately optimal value. 

Continual mechanism and pan-privacy: For the case where the database is updated over time, 
the theory community has investigated what they call differential privacy under continual obser- 
vation [23]. In their setting, the input is no longer a static vector, but instead a stream of O's and 
I's, denoted by cr G {0,1}"^, where T is an upper limit of time. The continual mechanism [23] 
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receives an input a[t\ € {0, 1} at each time t S [1, T], and outputs an approximation to the number 
of I's seen in the length t prefix of the stream. Two streams' prefixes crt,cr[ £ {0, 1}*, t G [^,T], 
are neighboring iff \\crt — cr[\\i < 1. The definition of e- differential pan-privacy under continual 
observation [24^ [23] is stronger than e-differential privacy and is stated as follows. 



Definition 2 ( [24], [23] ) Let I/c denote the set of internal states of the randomized mechanism 
K,. K. provides e-differential pan-privacy (against a single intrusion) if for all neighboring stream 
prefixes crt, cr[ £ {0, 1}*, t G [1, T] and for all sets I' C I^^ and Sub C Range{lC) , 

Prob{}C{(Tt) e {l',Sub)) <e' X Prob{IC{a[) G {l',Sub)), 

where the probability space in each case is over the coin flips of IC. 

The details of the continual mechanism are included in the appendix. The continual mechanism 
can be easily generalized to the case in which input D £ M-^. If we denote the approximate sum of 
the first t entries by S^ and the true sum of the first t input items as Tit, the result of [23j can be 
formalized (in a slightly differently way) as follows. 

Lemma 3 ([23]) The continual mechanism provides e- 

differential pan-privacy. At each time t G [1,^]; with probability at least 1 — 13, |S| — St| = 

0(log(l//3)logi-5(r)/6). 

4 Problem Definition 

This section formally defines the problems investigated in this paper. Specifically, we focus on 
e- differentially private randomized mechanisms with numeric outputs, which can be formalized as 
K, : R" — >• W^. Such randomized mechanisms can be used to answer (numeric) statistical queries (in 
the form of Q : M" — t- M ) about a database. In general, a randomized mechanism for publishing 
query results about a database must resolve a trade-off between utility and privacy. Utility means 
that the outputs should not be too far away from the true answers of the query, to ensure that the 
perturbed answers can still be helpful to users. Privacy requires that the outputs not be too near 
to the true answers, since some amount of random perturbation is essential for the mechanism to 
be e-differentially private. While many previous studies mainly deal with linear counting queries, 
in practice the queries are likely to be much more diverse and complicated, such as a multiphase 
analysis task. A general randomized mechanism with respect to a certain kind of database should 
be able to answer all possible queries with guarantees about utility and privacy. In other words, 
given a database Z) of a certain kind X, any query Q over D has to be answered with reasonable 
utility and privacy guarantees. For this purpose, this section defines the notion of a universal 
mechanism. We begin with the notion of an identity query, which just returns the entire database 
unchanged. Intuitively, the purpose of a universal mechanism is to accurately answer the identity 
query, subject to differential privacy. 

Definition 3 A universal mechanism with respect to a class X of databases is a randomized mech- 
anism Ux '. X —7- M" for answering the identity query, satisfying the following conditions: 

• e-differential privacy; 

• with high probability, \\D — D*\\2 = 0{log{n)/e), for any input D £ X and its corresponding 
output D* eR". 



Table 1: Error bounds (in terms of ||Z) — -D*||2) for the identity query. 

Mechanism Bound 



Compressive Mechanism 0(logn/e) 
Laplacian Mechanism 0{y/n/e) 
[Ml [28] 0{^/e) 



A universal mechanism is a good base for answering all kinds of statistical queries. To answer 
any query Q, we first apply Ux to a database D £ X, obtaining D* . Then we deterministically 
compute an answer to Q over D* . Since D* must be very close to D, utility can be guaranteed for 
any query. 

Another important property of a universal mechanism is that it allows us to answer an un- 
bounded number of queries without any concern for privacy budget issues. Previous e-differential 
privacy mechanisms can only answer a finite number of statistical queries, due to the budget limit. 
For example, if a mechanism answers two queries each in an e-differentially private way, then the 
mechanism may have provided only 2 e-differential privacy overall. The universal mechanism does 
not have this problem, as any system based on a universal mechanism always satisfies e-differential 
privacy, no matter how many queries are asked. 

Another advantage of universal mechanisms is that D* can be published in its entirety, thus 
supporting both interactive and non-interactive querying. For example, biologists who perform 
GWAS can simply publish allele frequencies, rather than answering queries about the frequencies. 
So a universal mechanism itself can be very useful, even without consideration of subsequent queries. 

We use 0(log(n)/e) as the upper bound of error in the definition of universal mechanism. One 
reason for this choice is that we conjecture that Q{log{n)/e) is the lower bound of error to satisfy e- 
differential privacy. Previous lower bounds [271 [161 [T^ cannot be applied to our case, and therefore 
we leave this conjecture as an open problem, elaborated further at the end of this article. 

To summarize, a universal mechanism is resilient to any form and any number of statisti- 
cal attacks from any number and kind of malicious attackers, which makes it universally robust. 
Meanwhile, the statistical query results remain relatively accurate, as D* must be very close to D. 

The Laplacian mechanism is not a universal mechanism, as it introduces too much error. 
Suppose we use it to answer the identity query for D, producing D* . With high probability 
||i) — Z)*||2 = @{^/n/e) (by using a Chernoff-like argument). Nor is any other known e-differentially 
private randomized mechanism universal. The challenge, then is to devise a universal mechanism. 
In later sections we introduce the compressive mechanism and show that it is a universal mecha- 
nism with respect to databases with a sparse representation. Table [T] compares the error bounds 
of the compressive mechanism and other contenders. In the remainder of this section, we present 
two example use cases for universal mechanisms. 

Example 1 Biomedical researchers use human genome sequence data to determine the correlations 
between particular diseases and combinations of SNPs. They perform statistical tests on the data 
after computing preliminary estimates of the importance of the SNPs 14^ - -^^ ^^6 US, frequency 
information for the SNPs in NIH-funded studies is no longer publicly available, due to privacy 
concerns. A universal mechanism could allow publication of this information, which would be used 
by many other researchers. 

Example 2 Internet service providers (ISPs) share statistical data from their network traces, to 
help detect anomaly events of large scale fS^. The identification of anomalies involves matrix com- 
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Figure 2: A universal mechanism for databases with updates (continual observation). 

putations on the statistics, which are not generally supported by existing mechanisms for differential 
privacy. 

A universal mechanism can also be helpful for databases that get updated. The continual 
mechanism of [23] can only answer linear counting queries (more precisely, only range counting 
queries starting from time 1). We aim to establish a pan-private mechanism such that at each 
time t, we can answer any query over the database states (-D[l], . . . ,D[t]) we have seen so far. We 
will realize this ambition by designing a mechanism for answering the identity query at each time 
t. The definition of the identity query in a dynamic setting is a straightforward generalization of 
identity queries in a static setting, and we omit it here. 

Figure [2] illustrates how a universal mechanism works under continual observation. When a 
new tuple -D(4) comes into the database, an updating procedure is triggered to generate a corre- 
sponding noisy tuple A^(4) under a universal mechanism. The noisy values calculated so far are 
published continuously to the public, guaranteeing that any snapshot of the publication follows the 
requirements of a universal mechanism. 

Example 3 The continual observation setting is important for public health report publication. For 
example, Singapore 's Ministry of Health periodically publishes the number of patients with certain 
chronic diseases. Treating each reporting period as a separate and unrelated database, and providing 
e-differential privacy for each separate database, does not provide e-differential privacy for patients 
across multiple reporting periods. To preserve the privacy of patients, it is important to design 
mechanisms that can support periodic release of statistics over a long period of time. 



5 Compressive Mechanism 

5.1 Mechanism Design 

The overall aim of the compressive mechanism is to answer the identity query for databases that 
have a sparse representation, in an e-differentially private manner. Algorithm [1] summarizes the 
compressive mechanism. The input to the compressive mechanism includes the privacy budget e 
and a database D whose sparse representation is a compressible vector x under an orthonormal 
basis ^ G M"^". Vector x can be well approximated by its xs, for some constant S. X is the set 
of all databases with a sparse representation. 

Applying the sampling operator $ produces a sample vector y = ^D G M'^', where k = 
@{Slog{n/S)). We add random noise to each entry of y. That is to say, for every y[i], we have 
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y*['^] = 2/[^] + ^W) where e[i] ~ Lap{\/k/e). Then y* = y + e. We can recover a noisy x* from j/* 
through the reconstruction process of compressed sensing. A noisy D* is obtained by D* = ^x* . 
Finally, D* is output by the compressive mechanism. 

Algorithm 1 Compressive Mechanism 

Input: privacy budget e G M"^, D ^ X (possibly together with a sparse basis ^ G R"^"). 

Output: D* G M". 

1: Generate a (normalized) random matrix $ G M with i.i.d. symmetric Bernoulli distribution. 



Acquire the sample y = ^D G M . 

Get a noisy sample y* = y + e G M'^ for e G M'^' with i.i.d. Lap{Vk/e). 

Reconstruct x* G M". 

Output D* = ^x*. 



5.2 Analysis 

Next we analyze the compressive mechanism. First we prove that it satisfies e-differential privacy. 

Lemma 4 The compressive mechanism satisfies e-differential privacy. 

Proof: See the appendix. D 

Next we show that D* is very close to D, thus ensuring utility. 

Lemma 5 With high probability, \\D — D*\\2 = 0(log(n)/e). 

Proof: See the appendix. D 

The two lemmas above lead immediately to the following theorem. 

Theorem 1 The compressive mechanism is a universal mechanism with respect to databases with 
a sparse representation. 

5.3 Discussion 

The compressive mechanism is e-differentially private for all D G M", and works especially well 
in terms of error bounds for D £ X. In other words, for any input D G M" we guarantee 
\\D* - D\\2 = 0{n'^/^/e) (from Corollary [2]) , while for D e X, \\D* - D\\2 < 0(log(n)/e) with 
high probability. In short, the compressive mechanism is e-differentially private for all D G M" and 
is a universal mechanism for databases with a sparse representation. 

Consider the issue of choosing the right S, the sparsity parameter. S may not be known in 
advance, and we may have to choose the best (i.e., with least error) S ourselves. S depends on 
the input data -D, and since k = Q{Slog{n/S)) and ||e||2 = 0{k/e), S also affects the encoding 
(adding noise) results. Therefore, we have to choose 5 in a differentially private way, and a natural 
method to achieve this aim is the exponential mechanism. S could be any element in [l,n] (if we 
require the compressive mechanism to work for all possible D G M"), and for each possible S, we 
define its utility function as 

C2||a;-a;5||i C^Slogin/S) 
" 7f e ' 
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where C2 and (74 are (known) constants. The right hand side of ([T|) is the upper bound on the error 
according to Lemma [1] (with instantiated). We calculate that the sensitivity of u{D,S),yS G 
[l,n], is ^u(D s) ~ ^s/v^j where C5 is a constant. Accordingly, the exponential mechanism 
outputs S with probability proportional to e"*^"^ ''^ u(s)) ^ where u{S) stands for u{D, S); satisfies 
e-differential privacy; and ensures the near-optimality of S with truly negligible failure probability. 

If the total privacy budget is e, we may choose to distribute a small part (say, O.le or O.Ole) 
to the exponential mechanism. In this case, the analysis of the compressive mechanism remains 
unchanged (recall that we analyze the error in an asymptotic way). 

The compressive mechanism may involve the art of identifying a suitable orthonormal basis 
q, g ]^"-x" under which D G R" can be sparse or compressible. This is a mature and profound area 
in mathematics that has been explored for many years, and we refer the reader to an outstanding 
book [15j. In this paper, we simply assume that D €z X has a sparse representation and its sparse 
basis ^ G M"^" could be treated as part of the input. In fact, many natural data sets have a sparse 
representation [8] , and that is one reason why compressed sensing has been so widely used since its 
invention. 

The time and space complexity of the compressive mechanism are both 0{n). 

6 Continual Observation 

This section focuses on an extension of the compressive mechanism, the compressive mechanism 
under continual observation, or CMCO for short. 

6.1 Mechanism Design 

As in the static case, we suppose that X C mJ" is the set of all databases with a sparse representation. 
The overall aim of CMCO is to answer the identity query at each time t in a differentially pan- 
private manner. The main steps of CMCO are summarized in Algorithm [2j 

CMCO takes as input a constant e, the parameter for differential pan-privacy; and (a prefix 
of) D £ X, which has a sparse representation x G M-^ under sparse basis ^ G M^^^. The input 
differs from the static case in that we do not receive all of D at once, but one value at a time; more 
precisely, at each time t, we receive D[t] G M. 

At each time t, we generate a new random vector $t G M'^ (recall that k = Q{Slog{T/S)), 
where S is the sparsity parameter of a;). Each entry of $j is distributed according to a symmetric 
Bernoulli distribution; more concretely, 

Prob{^t\i] = ±1/Vk) = 1/2. 

Upon receiving D[t], we apply $t to it and get a new vector Ut = ^tD[t] G M'^. 

As a result, by time f , there are t vectors Wi, . . . ,Ut G M , forming a matrix M G M . The i-th 
row of M, i G [1, fc], is called rrii G M*. As t grows, the size of rrii G M* grows correspondingly. We 
apply k independent continual mechanisms to each rrii to estimate the sum of the first t entries in 
rrii G I^* in an e-differentially pan-private way. The k independent continual mechanisms return a 
vector v1 G M'^ with each entry v^ [i] representing the estimate of the sum of the first t entries in rrii. 
We use v^ G K*^ as a resource for reconstruction and obtain a noisy x* G M*. We obtain D^ G M* 
by D^ = '^txl, where ^t £ M*^* is the orthonormal basis (of the corresponding ^ G M"'"^^) in a 
space of smaller dimension. 

We do not store M G M'^^* at all; and we just discuss it for analysis. What we store at each 
time t is just what is stored for the k e-differentially pan-private continual mechanisms, namely k 
noisy sums and some independent noise (see the appendix for further details). 
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Table 2: Error bounds comparison (in terms of \\Dt — D*^\\2) for answering the identity query at 

time t. 

Mechanism Bound 

CMCO 0(log^-^(r)/e) 

Continual Mechanism 0{\/t\og (T)/e) 

At time t, CMCO outputs D^ G M*, an estimate of Dt G M* (the first t terms of D G M^). 

Algorithm 2 CMCO (Compressive Mechanism under Continual Observation) 
Initial Input: privacy budget e G M^. 
Input at time i G [1,T]: !)[*] G M. 
Output at time i G [1,T]: D^ ^ M*. 

1: Generate a (normalized) random vector $j G M with i.i.d. symmetric Bernoulli distributions. 

2; Acquire the sample Ut = *f-D[t] G M'^. 

3: Get an estimate vector v^ G M*"', where each v1[i],i G [1, . . . , A;], is the output of a continual 

mechanism estimating X^»=i '"jH i'^ ^-'^ e-differentially pan-private way. 
4; Reconstruct a^j G M*. 
5; Output Dj = *ta;^. 



6.2 Analysis 

In this section, we analyze the performance of CMCO, which needs to be both private and useful. 

Theorem 2 CMCO is e-differentially pan-private and for each time t G [1,7"], \\Dt — Dt\\2 = 
0(log^'^(T)/e) with high probability. 

Proof: See the appendix. D 

6.3 Discussion 

First, what if we directly apply the continual mechanism to a problem? Suppose that at time 
t, we obtain S^, . . . ,SJ' G M in an e-differentially pan-private way. Then we perform subtraction 
operations to get the desired output: 

A*W = S:-S*_i,ViG[2,t]. 

Obviously Dt[^] — ^i- Calculations show that with high probability, 

\\Dt-D*,\\2 = diVilog'-HT)/e), 

indicating the performance is not as good as that of CMCO. The bounds comparison of CMCO 
and the continual mechanism is listed in Table [2j 

Second, the time and space complexity of CMCO are both 0(t + log(n)) at each time t G [1,T]. 

Last but not least, due to pan-privacy concerns, we are not allowed to store the original input 
values. 
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7 Experiments 

This section illustrates the effectiveness of the compressive mechanism and CMCO by experi- 
mental results on real data sets. We employ the data sets sociaLnetwork.txt, nettrace.txt, and 
searchJogs.txt, contributed by Hay, Rastogi, Miklau and Suciu [28]. The data set sociaLnetwork.txt 
is a graph derived from friendship relations in an online social network site; nettrace.txt is said 
to be collected at a university; and searchJogs.txt comes from search query logs collected between 
2004 and 2010. nettrace.txt consists of 2^^ = 65536 entries, while there are 2^^ = 32,768 elements 
in searchJogs.txt. sociaLnetwork.txt has 11342 entries. We also use a real data set tcptrace.txt, 
which has 7865 entries, tcptrace.txt was collected by Berkeley, and contain a 30-day trace of the 
TCP connections between their local network and the Internelij. 

7.1 Haar Basis 

In this part of the section, we discuss results on the data sets when employing Haar basis in 
compressive sensing. Haar basis forms the commonly used and simplest wavelet transformation. 

7.1.1 Choosing a Good S 

During the discussion of the compressive mechanism, we mentioned that the exponential mechanism 
can be used to choose a good sparsity parameter S. In this section, we show the feasibility of the 
exponential mechanism by running it on the three real data sets sociaLnetwork.txt, nettrace.txt and 
searchJogs.txt. We set the total privacy budget to e = 1 and allot O.le = 0.1 for the exponential 
mechanism. The remaining 0.9e = 0.9 will be used to run the compressive mechanism. In addition, 
we set * e M"^" to be a Haar basis [2j. 



Figure 3(a) through Figure 3(c) show how error changes as S varies. Then we run the exponen- 



tial mechanism 1000 times. Figure 3(d)| through Figure |3(f)| display the result of the exponential 



mechanism, explicitly supporting our claim that the exponential mechanism chooses a near-optimal 
S with truly negligible failure probability. 

These experiments show that the exponential mechanism works very well in practice and returns 
a near-optimal S. 

7.1.2 Compressive Mechanism 

In this section, we experimentally evaluate the performance of the compressive mechanism by 
comparing it with the Laplacian mechanism and HRMS mechanism [28 1. The sparse basis is still 
the Haar basis. 

Figure |4] shows the performance comparison. Both the horizontal and vertical axes use a log- 
arithmic scale. The horizontal axis denotes different choices of the parameter e for differential 
privacy; while the vertical coordinates are the errors, namely ||-D — -D*||2 of the input D and the 
output D* . Overall, the Laplacian and HRMS mechanisms cannot compete with the compressive 
mechanism, and as e becomes smaller, the compressive mechanism's advantage becomes larger. 

The exception is that for e = 1, the Laplacian mechanism is better than the compressive 
mechanism. This is because our data set is small, and thus ^/n is not much larger than S'log(n/S'). 
Combined with the reconstruction errors, the compressive mechanism is slightly worse than the 



Laplacian mechanism. As n becomes larger, from 11342 in Figure 4(a), 2^ = 32768 in Figure 



|4(b)[ to 2^^ = 65536 in Figure 4(c) , the advantage of the Laplacian mechanism for the case e = 1 



^http://ita.ee.lbl.gov/html/contrib/LBL-CONN-7.html 
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Figure 3: Choosing the right 5": the first three is using CS (Compressive Sensing); the last three is 
using EM (Exponential Mechanism). 
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Figure 4: A comparison of CM (Compressive Mechanism), LM (Laplacian Mechanism) and HRMS 
(HRMS Mechanism). 

becomes smaller and smaller. For other choices of e, the advantage of the compressive mechanism 
becomes larger and larger as n increases. 

7.1.3 CMCO 

To demonstrate the strength of CMCO over the continual mechanism, we set e and run the two 
algorithms on sociaLnetwork.txt, searchJogs.txt and nettrace.txt. Figure [5] shows that the continual 
mechanism has significantly lower utility than the compressive mechanism under continual obser- 
vation. The j;-axis varies the time t and the y-axis is the corresponding change in error, namely 
\\Dt — Dl\\2. The sparse basis is still the Haar basis. 

7.2 Cosine Basis 

The previous experiments used the Haar basis as the orthonormal basis ^ € M"^". This is an 
appropriate choice because the three data sets of Hay, Rastogi, Miklau and Suciu [28j have the 
property that every two adjacent elements in a file are very close to each other. To illustrate the 
power of the compressive mechanism, here we employ a different orthonormal basis, the cosine 
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Figure 5: A comparison of CMCO (Compressive Mechanism under Continual Observation) and 
ContM (Continual Mechanism). 

basis [3J, on a time-related data set tcptrace.txt. Figure [6] shows that the compressive mechanism 
far outperforms the Laplacian mechanism and HRMS mechanism. 

8 Concluding Remarks 

We have introduced the compressive mechanism as a means of realizing the idea of a universal 
mechanism. We have provided theoretical bounds and experimental results for the compressive 
mechanism, and showed how to apply the compressive mechanism to a case of continual observation. 

As mentioned earlier, one open problem concerns the lower bound of error in the definition 
of the universal mechanism. Formally, assume that there is an e-differentially private mechanism 
to answer the identity query. Then what is the lower bound of ||-D — -D*||2, where D* G M" is 
the output of the mechanism? Here X can be a subset of M", such as databases with a sparse 
representation. 

Regarding future work: are there applications of the compressive mechanism to other scenarios 
beyond continual observation? What if we use the compressive mechanism to satisfy other privacy 
definitions such (e, (5)-differential privacy or zero-knowledge privacy? How can one incorporate 
considerations such as integrity and consistency constraints into our framework? 
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A Proof of Lemma [4] 

First let us focus on the sampling process, which can be characterized as a random projection 
g : X ^M.^. For any D £ X, g{D) = y = $-D. We have mentioned that the random matrix $ is 
formed by sampling i.i.d. entries from a symmetric Bernoulli distribution; more accurately, 

Prob{^{i,j) = ±l/Vk) = 1/2. 

The sensitivity of g is A^ = \/k. In the compressive mechanism, we make use of a Laplacian 
mechanism ICg : X — )• M , which is e-differentially private according to Lemma [2j 

The subsequent reconstruction process of compressed sensing is a deterministic process and 
does not involve probability. Thus the whole compressive mechanism is e-differentially private. 

B Proof of Lemma [5] 

Each e[i] is distributed according to 

Lap{yk/e) = Lap{\). 

We define a random variable Y = 'Ylii=i^[^Y- We can compute that E{Y) = 2kX^ and that 
Var{Y) = 20/cA^. By the Chebyshev inequalitja. 



Prob{\Y\ < 2\^{k + ^/2k/5)) >l-5. 
Along with the fact that ||e||2 = vY , we see that with probability at least 1 — 5., 

\\e\\2 = 0{k/{e6^l^)). 

Recall that with probability at least 1 — exp(— fc), A = ^^ G M^^" satisfies RIP. Using a union 
bound and Corollary [H with high probability, 

||£)-D*||2 = 0(log(n)/e). 

C Proof of Theorem [2] 

The e-differential pan-privacy of CMCO is straightforward: at time t, we estimate Xl^i ''^i W i^ ^^ 
e-differentially pan-private way and any operation after this step preserves pan-privacy. 

Suppose that at each time i, each entry Vt[i] of the vector Vt G M^ represents the sum of the first 
t entries in rrii G M*. We apply a continual mechanism to each rrii G M*. For two neighboring inputs 
Dt,D^ G M* with \\Dt — D[\\i < 1, their corresponding mi,m'- G M* satisfies ||Tnj — m'J|i < \l\fk. 
Combining this fact with Lemma [3l we know that with probability at least 1 — /3, 

l^tW -^?WI =0(log(l//3)W•'(T)/(e^/^)), 

for each i G [1, . . . , /c]. By a union bound, with probability at least 1 — /3, 

\vt{^-v\\i\\ = 0{\og{k/(3)\og^-\T)/{eVk)) 



■^The Chernoff bound cannot be applied, as the integral defining the moment-generating function of Y does not 
converge unconditionally in a neighborhood of 0. 
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holds for all i simultaneously. Along with the Chebyshev inequality, we have that with probability 
at least 1 — 13 — 5, 

\\vt -v*\\2 = 0(log(fe//3) log'-HT)/{e6'/^)). 

Then by Corollary [1] and a union bound, we obtain the result that with high probability, 

\\Dt-D;\\2 = d{log'-HT)/e). 

D More on Reconstruction Process 

In the paper, we just used the reconstruction process in a black-box manner. Here we discuss the 
two reconstruction algorithms mentioned [71 138] more concretely. 
Remember the equation we have mentioned: 

y* = Ax + e, (2) 

where A G M*^^", e,y* G M'^ and x G R". A and y* are known, e is some unknown noise and we 
want to know what x is (reconstruct a;). 

What f7J does is to solve the convex program 

min ||a;*||i s.t.llAx* — y\\2 < 6, 
x*eR" 

where 9 bounds the amount of noise e. 

j38] is more complicated by using a greedy algorithm to solve ([2]). The essential idea is to 
iteratively approximate the target vector x. At each step, the current approximation induces a 
residual, the part of the target vector that has not been approximated. Samples are updated so 
that they reflect the current residual and are used to construct a proxy for the residual, allowing us 
to identify the large components in the residual. This step yields a tentative support for the next 
approximation and samples are used to estimate the approximation on this support set using least 
squares. The process is repeated until recoverable energy is met in the vector. 

E More on Continual Mechanism 

We also used continual mechanism [23j in a black-box way in the body of the paper. Here we 
discuss the content of the mechanism. 

Continual mechanism receives an input cr(t) G {0,1} at each time t G [T], and outputs an 
approximation to the number of I's seen in the length t. 

There is a preprocessing step: segmentation. For i G [logT] (assume T to be a power of 2 
without loss of generality), associate with each string s G {0,1}* the time segment S of 2^°^ 
time periods {so o'°s-^~*, . . . , s o iiog'^~«}. The segment begins in time s o Q^°sT-^ ^j^^j ends in time 
soli°s^-\ 

Then comes the processing steps. At each time t G [T], continual mechanism maintains logT 
segments that contain t — 1. Each of these logT segments has a noise associated with it sampling 
from Lap{{l + logT)/e). The output at each time t G [T] is the summation of the noise from these 
logT segments and the count (true count plus a random noise sampling from Lap((l-|-logT)/e)). 
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